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Abstract 

The logistic regression model is known to converge to a Poisson 
point process model if the binary response tends to infinitely imbal- 
anced. In this paper, it is shown that this phenomenon is universal in 
a wide class of link functions on binomial regression. The proof relies 
on the extreme value theory. For the logit, probit and complementary 
log- log link functions, the intensity measure of the point process be- 
comes an exponential family. For some other link functions, deformed 
exponential families appear. A penalized maximum likelihood estima- 
tor for the Poisson point process model is suggested. 
Keywords: binomial regression; extreme value theory; imbalanced 
data; Poisson point process; ^-exponential family. 



1 Introduction 

Let {(X i) Y i )} r [L 1 be m independently and identically distributed observable 
data on W x {0, 1}. The conditional distribution of Yj given Xi is assumed 
to be 

P(Yi = 1 | X u a, b) = G(a + & T X;), a e R, beW, (1) 

where G(-) is a one-dimensional cumulative distribution function. The in- 
verse function G^ 1 (p) = sup{z : G(z) < p} is the link function in terms of 
generalized linear models. Denote the marginal distribution of by F(dXj). 
The distribution function G is typically the logistic, standard normal or Gum- 
bel distributions. The corresponding link functions are the logit, probit and 
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complementary log-log functions, respective ly. For the three ex amples, the 
log-likelihood function of is concave; see IWedder burnl (119761 ) . 

Our interest is the situation that the data is highly imbalanced. In 
other words, the probability of success is almost zero. Examples of such 
cases ar e fraud detection, med ic al diagnosis, politica l analysis and so forth. 
See e.g. iBolton fc Hand! ( l2002h . IChawla et ail ( l2004h . Ijin et all fl2005h . and 
King Sz Zend ( 1200 ll ). For the data without covariates, Poisson's law of rare 
events is well known: if P(Yi = 1) = X/m + o(m _1 ), then the probability 
distribution of Y^iiLi ^ converges to the Poisson distribution with the mean 
parameter A. From this observation, for highly imbalanced data, it is natural 
to consider that the true parameter (a, b) in ([T]) depends on m, say (a m , b m ), 
an d G(flm) — > as m — > oo. 



Owenl ( 120071 ) showed that the maximum likelihood estimator of the logis- 



tic regression model converges to that of an exponential family if YllLi Y is 
fixed and m goes to infinity. This result is roughly derived as follows. Con- 
sider the model (JTJ with the logistic distribution G(z) = e z /(l + e z ). Take 
a m («) = — logm + a and b m (j3) = (3 for any fixed a and (3. Then we obtain 



P(y, = l|X l ,a m («),6 m (/3)) 



e — log m+a+f3 T X t 
I _|_ e -\ogm+a+P T Xi 



3a+{3 L Xi 



m 



o(m 



(2) 



as m — > oo. By Bayes' theorem, the conditional density of Xi given Yi 
with respect to the distribution F{dXj) is, at least formally, 



^ T Xi 



+ o(l). 



(3) 



This is an exponential family with the sufficient statistic x,, and Owen's 
result follows. 



Remark 1. To be precise, lOwenl (120071 ) proved the convergence result under 
a different setting from here. He assumed that the true conditional distri- 
bution of Xi given Yi = j, j G {0, 1}, is any distribution Fj. In our setting, 
F is asymptotically equal to F, and the density of F\ with respect to F 
should satisfy ([3]). In other words, our setting becomes misspecified unless 
this equality is satisfied. We discuss this point again in Section 



Warton fc Shepherd! ( 120101 ) pointed out that the likelihood of logistic re- 



gression converges to a Poisson point process model with a specific form of 
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intensity. Indeed, by 02]), the probability P(Yi — l,Xi G A) is approximately 
m" 1 J A e a+/3 x F(dx) for any compact subset A of IR P . Therefore, by Pois- 
son's law of rare events, the number of observations Xi for which Xi E A 
and Yi — 1 is approximately distributed according to the Poisson distribu- 
tion with mean j A e a+l3 x F(dx). This is the Poisson point process with the 
intensity measure e a+ ^ x F{dx). 

In this paper, we consider the limit of various binomial regression mod- 
els other than the logistic model. As expected from the result on logistic 
regression, the limit becomes a Poisson point process. A remarkable fact 
we prove is that the intensity measure of the point process should be a q- 
exponential family for some real number q. The g-exponential family, also 
called the deformed exponential family or a-family, is recently much inves- 
tig ated in t he lit e rature of statistical physi cs and information geometry; see 
e.g.lAmaril (Il985) JAmari k Naeaokal fcOQOh . lAmari fc Oharal (l201lh . lNaudts 
f l2002h . iNaudtsI feoioj l. and lTsallisI dl988h . The precise definition is given in 
Section [2J The proof relies on the theory of extreme values. For example, 
for the probit or complementary log-log link functions, the limit of binomial 
regression is the usual exponential family as with the logit link. On the other 
hand, if G is the Cauchy distribution, then the limit becomes a g-exponential 
family with q = 2. If the uniform distrib ution is used, q = 0. 



As a related work, iDing et al.l (120111 ) introduced the ^-logistic regression, 
that uses the g-exponential family for binary response, where q = t. In Sec- 
tion [31 we show that the t-logistic regression converges to the g-exponential 
family if g > 0. 

In Section HJ we study a penalized maximum likelihood estimator on 
the g-exponential family of intensity measures. For some special cases, the 
estimator is reduced to a known admissible estimator for the Poisson mean 



parameter; see iGhosh fc Yang (119881 ). 

Some related problems are discussed in Section [51 



2 Imbalanced asymptotics of binomial regres- 
sion 



For each real number q, define the g-exponential function by 
6XP ^ = { [I'+d-g).]^), 



(4) 
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where [z]+ = max(z, 0) and [O]^ 1 = oo. This is inverse of the Box-Cox 
transformation. Note that exp q (z) = oo for z > —1/(1 — g) if q > 1. The 
function exp g (z) is convex if and only if q > 0. 

Consider the binomial regression model and put the following assump- 
tion on the distribution function G. 

Assumption 1. There exist q > 0, c rn G R and <i m > such that 

G(c m + d m ^) = — expJz) + o(m~ 1 ) (5) 
m H 

as m — )• oo for each z G R. 

In the extreme value theory, it is kno wn that there is no o t her a symptotic 
form than (jSJ) as long as it exists; see e.g. Ide Haan fc Ferreiral ( 120061 . Theorem 
1.1.2 and 1.1.3). The number q controls the lower tail structure of G. For 
example, the logistic distribution satisfies Assumption [1] with q — 1, c m = 
— logm and d m = 1. Other examples including the normal and Cauchy 
distributions are considered in Sectional 

We define 

a m (a) = c m + d m a and b m {{$) = d m (5 (6) 

for (a, /3) G R x R p by using the sequences c m and d m that satisfy (jSJ). Denote 
the probability law of {(Xi, Yi)}^ under the true parameter (a m (a), b m (j3)) 

by Pm,a,j3- 

Now the asymptotic form like (|2J) follows from the assumption. Indeed, 

P m ,«A Y i = 1 \Xi) = G(a m (a) + U/?) T ^) 
= G(c m + d m (a + (3 T Xi)) 
1 



— exp (a + f3 T Xi) + o(m 
m 



Therefore, as in the logistic regression, we expect that the binomial regression 
model with G converges to the Poisson point process under Assumption [TJ 
We give a lemma before the main result. 

Lemma 1. Let (a, 0) G R x R p . Let A be any compact subset of W such 
that the function exp q (a + (3 T x) is finite over x G A. Then the following 
equation holds: 

P m ,aA Y i = Ui^) = — + oim- 1 ), (7) 

m 

where A (A) = f A exp (a + (3 T x)F(dx). 
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The proof of Lemma [T] is given in Appendix. 

Theorem 1. Denote the observations X$ for which Y{ = 1 by {xj}" =1 . Then, 
under P mjQ!)1 a, the set {xj}" =1 converges in law to the Poisson point process 
with the intensity measure 

X(dx) = exp g (a + (3 T x)F(dx) (8) 

as m — > oo. More precisely, we have 

_L \(A-) n ie~ x( - A A 

lim P mja>0 (#{* | Xi E Aj} = rij, j = 1, . . . , J) = TT V j; (9) 

rn— >oo J --^ 77,-! 



for any positive integer J, non-negative integers rij and mutually disjoint 
compact subsets Aj of MP such that exp 9 (a + /3 T x) is finite over x & Aj. 

The equation ([9] ) is consistent with the definition of weak convergence of 
point processes; see lEmbrechts et al.l (119971 ). 

Proof of Theorem U\ Define 

x(A) =#{i e {l,...,n} | Xi e A} 

= #{j6{l,...,m}|(X !i r ! )eix{l}}. 

Since {(Xj, Y i )}™ =1 is an independent and identically distributed sequence, the 
random vector (x(A\), . . . , x(Aj)) for the disjoint compact subsets {Aj}j =1 is 
distributed as the multinomial distribution. Then, by Lemma [Hand Poisson's 
law of rare events, (x(A\), . . . , x(Aj)) converges to independent Poisson ran- 
dom variables with intensity (X(Ai), . . . , X(Aj)). The proof is completed. □ 

By Theorem the logistic regression model conv erges to the Poisson 
point process model with intensity exp(a+f3 T x)F(dx) as lWarton fc Shepherd 
f l2010h showed. 



Definition 1. For each q e R, we call the set of intensity measures (jHJ) the 
(/-exponential family of intensity measures. Denote the law of the process 



{£;}?=! with respect to © by P^ q) 
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The q-exponential family of intensity measures is closely related to the 
g-exponential family of probability measures as follows. Denote the total 
intensity by 

A q (a,(3) = / exp 9 (a + (3 T x)F(dx). (10) 

J MP 

Assume A q (a,/3) < oo. Then the likelihood of P^l is 

e -A,(a,0) JL 

j [[exp q (a + f3 T Xi), (11) 

i=l 

where the base measure of n is the counting measure on {0, 1, • • • }, and the 
base measure of Xi for each i is the distribution F(dxi). In (TTTj) . the number 
n of observed points is marginally distributed according to the Poisson dis- 
tribution with intensity A q (a, (3). Each point Xi is independently distributed 
according to the g-exponential family defined by the probability density func- 
tion 



exp q (a + /3 T Xi 
A,(a,/3) 



(12) 



with respect to F(dx). The q-exponent ial family is also ca l led th e deformed 
exponential family or the a-family; see lAmari fc Nagaoka fl2000h for the a- 



family, where a = 2q — 1 should be distinguished with the regression coeffi- 
cient a. It is known that the density ffT2]) is also written as exp (9 T Xj — %jj q (9) ) 
with appropriate 9 and ip q (9); see e.g. Amari fc Ohara ( 201ll ). However, we 



do not use this parametrization since the quantity A g (a,/3) remains in the 
whole likelihood (TTTT) . 

We conjecture that the maximum likelihood estimator of the binomial 
regression model P m>a ,p converges to that of the Poisson process model P^ l 
under mild conditions. However, we only give experimental results in Sec- 
tion Instead, we study the estimation problem of the limit model P^ l in 
Section |U See also Section \5\ for further discussion. 



3 Examples 

In this section, we give some examples of distributions G satisfying Assump- 
tion dj and experimental results on the maximum likelihood estimation. 
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Even if G satisfies Assumption [Q the sequen ces c r , 



and d m a re not 



uniquely determined. A unified choice is known (see iGalambosi ( 119871 . Theo- 
rem 2.1.4-2.1.6)). However, in the following examples, one of possible pairs 
(cm, d m ) is explicitly given for each case. 

For the logistic distribution and the Gumbel distribution G(z) = 1 — 
exp(— e z ) on minimum values, we have 

q = l, c m = -logm, d m = l. (13) 

For the standard normal distribution, we have 

log(logm) + log(47r) 



<1 



-(2 logm 



,1/2 



+ 



2(2 logm 



,1/2 



d m = (2 log m) 



-1/2 



(14) 



See e.g. IGalambosi ( 119871 . Section 2.3.2). For the Cauchy distribution, we 
have 

q = 2, c m = -m/7r, d m = m/ix. (15) 
For other e xamp les s uch as t-distribution an d Pareto distributions, refer to 



Galambosl (119871 ) and lEmbrechts et al.l (119971 ). 



We briefly study the t- logistic regression proposed by iDing et al.l (120111 ). 
For each real number t, let G t (z) = exp t (z — 7t(z)), where exp 4 denotes the 
g-exponential function with q — t and 7t(z) is uniquely determined by 

exp t (z - y t (z)) + exp t (-7 t (*)) = 1. (16) 

We call G t (z) the t-logistic distribution. Uniqueness of jt( z ) follows from 
strictly monotone property of the g-exponential function. The distribution 
G t (z) is symmetric in the sense that G t (—z) = 1 — G t (z) since 7t(— z) = 
—z + 7t(z) by (TT6|) . We obtain the following theorem. The proof is given in 
Appendix. 

Theorem 2. The t-logistic distribution Gt satisfies Assumption [1] with q = 
max(t, 0). 

Table [T] and Table [2] show the experimental results. The sample is 

(0.4 + 0.4(i-l)/(n-l),l) if ie{l,... t n} t 
((z — n — l)/(m — n — 1), 0) if i e {n + 1, . . . , rn} 

for n — 10 and various m's. For the binomial regression models, the esti- 
mated regression coefficient (a, b) is normalized by ([6]). From Tabled], the 
convergence rate for the probit link is very slow, or may not converge. For 
the others, the rate is satisfactory. 



(17) 
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Table 1: Comparison of the maximum likelihood estimate of the Poisson 
point process model with q — 1 and the binomial regression models. The 
logit, probit and clogog (complementary log-log) link functions are used. 
The sample is (JT7J) and n is fixed to 10. The normalizing sequence (c m , d m ) 
is (JEP and flU}. 



m 


Poisson process 

a f3 


logit 

a (3 


probit 

a (3 


cloglog 

a (3 


10 2 
10 3 
10 4 
10 5 


1.6504 1.1737 
1.6277 1.2246 
1.6256 1.2294 
1.6254 1.2299 


1.6883 1.3067 
1.6314 1.2373 
1.6260 1.2307 
1.6254 1.2300 


2.0282 2.3030 
1.9070 1.8777 
1.8634 1.6725 
1.8330 1.5642 


1.6975 1.1883 
1.6322 1.2260 
1.6260 1.2295 
1.6254 1.2299 



4 Estimation of the (/-exponential family of 
intensity measures 

We deal with estimation problem of the g-exponential family of intensity 
measures (jSJ). The maximum likelihood estimator is likely to fail to exist for 
small sample size n. We propose a penalized maximum likelihood estimator. 
We put the following assumption for simplicity. 

Assumption 2. The covariate distribution F(dx) is known. The support of 
F, denoted by S(F), is finite, and is not included in any hyperplane in M. p . 
The observable data {xi}f =l belongs to S(F). 

In practice, F(dx) may be replaced with the empirical, or estimated, 
distribution based on the covariate sample {X i }™ =1 of the original regression 
problem. 

The parameter space is 

= {(a, (3) | l + (l-g)(a + (3 T x) > for any x G S(F)}. (18) 

The set 6 is convex and unbounded since it is intersection of half spaces 
including the set {(a,0) | 1 + (1 — q)a > 0}. Furthermore, 6 is open since 
S(F) is compact . In term s of co nvex analysis, 6 corresponds to the polar 
set of S(F). See iBarvinokl fl2002[ ). 



S 



Table 2: Comparison of the maximum likelihood estimate of the Poisson 
point process model with q = 2 and the binomial regression model with the 
cauchit (inverse of Cauchy) link function. The sample is (JT7J) and n is fixed 
to 10. The normalizing sequence (c m ,d m ) is (TI5]) . 



m 


Poisson process 

a (3 


cauchit 

a (3 


10 2 
10 3 
10 4 
10 5 


0.8662 0.0667 
0.8626 0.0673 
0.8622 0.0680 
0.8621 0.0680 


0.8632 0.0656 
0.8623 0.0677 
0.8622 0.0679 
0.8622 0.0679 



We consider a penalized log-likelihood function 

n „ 

-A q (a,/3) + ^logexp ? (a + (3 T Xi) + k log exp q (a + (3 T x)F(dx), (19) 
i=i J 

where k is a non-negative regularization parameter. If k — 0, ( fl9l) is the log- 
likelihood function; see ( ITT]) . The penalty term represents a pseudo-data of 
size k distributed according to F. The function (fl9|) is concave with respect 
to (a, (3) if < g < 1. Indeed, we can directly confirm that — exp g (z) is 
concave if q > 0, and that log(exp (z)) is concave if g < 1. 

Definition 2. We call the maximizer of (fl9|) the additive-smoothing estima- 
tor. 

This estimator has a desirable property as shown in the following example, 
even if q = 1. 

Example 1. Let F be a two-point distribution on ]R defined by 

F(x = 0) = po and F(x = 1) = pi, 

where Po,Pi > and p + p\ = 1. Denote the intensity at x = and x = 1 
by A = po ex P g ( a ) an d Ai = piexp 9 (a + 0), respectively. It is not difficult 
to show that (a, (3) e corresponds one-to-one with (A ,Ai) G M^, where 
K + is the set of positive numbers. Hence the model is equivalent to the 
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independent Poisson observable model with intensity (A ,Ai), regardless of 
q. Then the penalized log-likelihood (I19p becomes 



-A - Ai + n log A + ni log A x + k (p log — + Pi log — 

V Po Pi 

where rij denotes the number of observations Xi = j, j £ {0, 1}. The additive- 
smoothing estimator is Xj = rij + npj, j £ {0, 1}. If k > 0, then (Ao, Ai) £ IR+ 
and the estimator (&,$) always exists. Furthermore, if < k < 1, this 
estimator is kno wn to be admissible w ith respect to the Kullback-Leibler 



loss function; see iGhosh &: Yangl (119881 . Theorem 1). For the same reason, if 



S(F) has only p + 1 points in MP, then the additive-smoothing estimator is 
admissible as long as < k < 1. 

Let q = 1 and F be any distribution satisfying Assumption [2j Then, 
since the model (fTTT) is an exponential family, the pair (n, x n ) is a sufficient 
statistic, where x n = n -1 J^" =1 Xi is the sample mean. Indeed, the additive- 
smoothing estimator should satisfy 

, . f xe^ x F(dx) nx n + k f xF(dx) , . 

A x (a,/3) = n + K and ^ = i — - = — J - ^ — '-. 20 

For the maximum likelihood estimator, meaning k = 0, the second equation 



of (1201) is consistent with the result of lOwenl (120071 ) . From the theory of 



exponential families, the solution to f[2"0"j) always exis ts if n > since / xF(dx) 



belongs to the interior of the convex hull of S(F); see IB arndorff- Nielsen! ( 119781 . 
Corollary 9.6). On the other hand, the maximum likelihood estimator fails 
to exist if boundary point. 

For q ^ 1, we provide a similar result on existence. First consider the 
following example. The pair (n, x n ) is not a sufficient statistic any more. 

Example 2. Let q = and F be a three-point distribution on R defined by 
F{ x = j) = 1/3 for j £ {0, 1, 2}. Denote the number of observations Xj = j 
by rij. We use 9 = 1 + a and 0=l + a + 2/3asa new parameter. Then the 
parameter space is 9 > and <fi > 0. The penalized log-likelihood is 

- 9 -^T- + n *o log + n; log 9 -+A + n ; log 0, (21) 
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where n* = nj + k/3. The maximizer (9, 0) of (12 ip is 

* = 2n*(ng + r^ + n^) ^ - = 2^(^ + 7^ + ^) 
n* + n* an + 

This always belongs to the parameter space if k > 0. On the other hand, the 
maximum likelihood estimator fails to exist if no = or ri2 = 0. 

In general, the following theorem holds. The proof is given in Appendix. 

Theorem 3. Let q be any real number and k > 0. If Assumption [2] is 
satisfied, then the additive-smoothing estimator exists almost surely. It is 
unique if < q < 1. 



5 Discussion 

5.1 Multinomial regression 

We studied so far the binomial regression. There are variants of multino- 
mial regres s ion m odels. The multinomial t-logistic regression proposed by 



Ding et al.l ( 1201 ll ) can be proved to have a limit under imbalanced asymp- 
totics in the same manner as Theorem [2j The author was not aware of more 
general results. The problem is postponed as a future work. 

5.2 Convergence of estimator 

We did not study convergence properties of estimators such as the maximum 
likelihood estimator. Instead we considered the additive-smoothing estimator 
for the (/- e xpone ntial family of intensity measures in Section HI 



Owed (120071 ) showed that the maximum likelihood estimator of the logis- 
tic regression converges to that of the exponential family under imbalanced 
asymptotics. Then a natural conjecture is that the maximum likelihood es- 
timator of the binomial regression model, which is the maximizer of 

m 

\Xi G(a + b T X t ) + (1 - Yi) log{l - G(a + 6 T X,)}] , 

2=1 

converges to that of the g-exponential family. Note that estimation of (a, b) 
is equivalent to that of (a, /3) via the formula ([H]). It will be also meaningful 
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to study convergence of statistical experiments; see Ivan der Vaartl (119981 ) for 
the terminology. 

An estimator corresponding to the additive-smoothing estimator of Defi- 
nition |2] is the maximizer of 



m 

K 



[Yi l°gG(a + b T X t ) + (1 - Yi) log{l - G(a + 6 T X l )}] + - ^ log{mG(a+b T X t )} 

i=l i=l 

since the additional term converges to k J logexp g (a + (3 T x)F(dx) after nor- 
malization (EJ). The estimator is expected to converge as well. 

5.3 Misspecified case 

We studied asymptotic properties of the binomial regression m odel under an 



assumption that the model (DD) is true. On the other hand, lOwenl (120071 1 
put a different assumption, in that the true conditional distribution of the 
covariate Xi given Yj = j, j £ {0, 1}, is fixed to some distribution Fj. In 
this assumption, our setting is asymptotically described as F Q (dx) = F(dx) 
and Fi(dx) = {exp q (a + (3 T x)/ A q (a, f3)}F(dx) by (fTTT) . In other words, if the 
true distributions Fj do not satisfy this relation, the model is misspecified. 

It is important to consider robustness of estimators under the misspecified 
assumption. The problem is not so serious if the support of Fx is included 
in that of F, since then F x is absolutely continuous with respect to the esti- 
mated intensity measure exp(d + (3 T x)F(dx), whenever (d, (3) belongs to the 
parameter space ( 1T81) . Otherwise, however, F\ is not absolutely continuous. 
In other words, the estimated intensity measure does not allow that the fu- 
ture data x n+ \ falls into a region. In particular, if the support of F\ is not 
assumed a priori, there is risk of such a contradiction. 

One may consider to take a distribution F with the full support 1R P in 
order to contain the support of F\. However, if q ^ 1, we cannot assume 
such a distribution F since the parameter space ( 1T8|) becomes {(a,0) | 1 + 
(1 -q)a > 0}. 

A solution to this problem will be to use a parametric family of F together 
with a Bayesian prior distribution. For example, let F(dx) = F(dx \ 8) be the 
uniform distribution on the hypercube [—0, 8] p , and assume a prior density 
on 9 > 0. As long as the true Fi(dx) has compact support, we have a chance 
to detect it since there is a sufficiently large 9 such that the support of F\ is 
included in that of F(- \ 9). 
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5.4 Bayesian prediction 



In the preceding subsection, we considered the Bayesian approach for treating 
misspecified case. Even if the model is correctly specified, the approach will 
be fruitful. 

In Section HI we considered the additive-smoothing estimator of 
This is considered as a maximum-a-posteriori estimator if the prior density 



7r(a, 0) = exp I k \ log exp (a + /3 T x)F(dx) 



is adopted. Then additive-smoothing Bayesian prediction can be also defined 
by the same prior. 

In Example (TJ we noted that, for special cases of F and k, the additive- 
smoothing estimator becomes an ad missible estimator with respect to the 
Kullback-Leibler divergence, shown by lGhosh fc Yangl (119881 ) . For p rediction 
proble m, a class of admissible predictive densities is investigated by iKomaki 
( 12004] ) . Together with the additive-smoothing estimator, decision-theoretic 
properties of the additive-smoothing prediction are of interest. 
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A Appendix 

A.l Proof of Lemma [I] 

Denote the induced probability distribution of t = a + (3 T Xi by F*(dt). Let 
A* be A* = {a + /3 T x | x G A}. Then A* is compact since A is. We have 

P m ,aA Y i = l,Xi£A)= [ G(a m (a) + b m ((3) T x)F(dx) 

J A 

= / G(c m + d m (a + (3 T x))F{dx) 

J A 

= [ G(c m + d m t)F*(dt). 

J A* 
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To prove ([7]), it is enough to show that 

1 



G(c m + d m t)F*(dt) 



m 



exp (t)F*(dt) + o(m~ 



By Assumption [H we know mG(c m + d m t) = exp (t) + o(l) for each t G A*. 
Hence it is enough to show that mG(c m + d m t) converges to exp (t) uniformly 
in t G A*. However, since mG(c m + d m t) is monotone in t and exp q (t) is 
continu ous in t G A*, uni form convergence follows from the general argument; 
see e.g. iGalambod (119871 Lemma 2.10.1)). 



Proof of Theorem [2] 

For each real number q, denote the set of distributions that satisfy Assump- 
tion Q] by V q . 

For t = 1, elementary calculation shows that G\{z) = e z /(l + e z ). This 
is the logistic distribution and belongs to T>\. 

For t = 0, we have Gq(z) = (1 + z)/2, — 1 < z < 1. This is the uniform 
distribution on [—1, 1] and belongs V . 

Let t > 1. It suffices to show that 

G t (z) = [(1 - t)*] 1 ^ 1 -*) + o((-z) 1/(1 ~* ) ), z -> -oo. 
Indeed, by the condition (TT5|) . if z — > — oo, then 7t(z) — >■ 0. Thus 
exp t (z - j t (z)) = exp t (z + o(z)) 

= [l + (l-t)(. + o(.))] 1/(1 " 4) 
= [(l-t)^] 1 /( 1 -*)+o((- 2 ) 1 /( 1 -')) 

Hence G t belongs to V t . 

For t < 1, we first show that the support of G t has the infimum z* = 
— 1/(1 — t) and that 7t(z) tends to as z — > z* + 0. Note that the t- 
exponential function exp t (z) is continuous in z G R, strictly increasing over 
z > z*, and remains over z < z*. Since exp t (z) > 1 for any z > 0, it 
must be 7t(z) > for any z G R by (fl6j) . Then exp f (z — 7*(z)) > only 
if z > z*. Conversely, if z > z*, it must be exp t (z — 7t(z)) > 0. Indeed, 
if exp t (z — 7t(z)) = 0, then 7t(z) = by (ITB|) . but this contradicts z > z*. 
To prove 7t(z) — > as z — >■ z* + 0, due to (ITB|) . it is sufficient to show that 
exp t (z — 7t(z)) — > as z — > z* + 0. This is shown as 

< exp t (z - jt(z)) < exp t (z) ->■ 0, z ->• z* + 0. 
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Let < t < 1 and z* = — 1/(1 — t). It suffices to show that 

G t (z) = [(1 -t){z- z*)} 1 ^ + o((z - z ^ ^ + 0. (22) 

By the definition of z*, we have 

exp t (z - 7t (*)) = [1 + (1 - t)(z - 7i(^))] 1/(1 -* ) 

= [(1 - t)(z - z* - ^(z))} 1 ^ . (23) 

On the other hand, since 7t(z) — > as z — > z* + 0, we obtain 

exp t (- 7t (z)) = 1 - -y t (z) + o( lt (z)). (24) 

By substituting the two equations to (TIB]) , we obtain 7t(z) = 0((z— z*) 1 ^ 1 "')) = 
o(z — 2*). Then (123]) implies fl22|) . Hence G t belongs to 

Finally, let t < and = — 1/(1 — t). We show that G t belongs to V , 
not T> t . It suffices to show that 

G t (z) = (z — z*) + o(z — z — > 2;* + 0. (25) 

For the same reason as the case < t < 1, we have the two equations (1231) 
and fl24|) . By substituting them to ffl6|) . we obtain 

7t (*) = (z - z.) - + o((z - z^). 

Then (1231 implies (125)) . Hence G 4 belongs to £>o- 

Proof of Theorem [3] 

Uniqueness follows from concavity of fTiTJj) for < g < 1. We prove the 
existence result. Since the case q — 1 is proved in (120|) . we assume g 7^ 1. 

In the following, we prove the theorem only for the case that n = 0, that 
is, no data is observed. The case n > 1 is similarly proved if one notes that 
{^;}iLi is contained in the convex hull of the support of F. 

Let F be a discrete distribution with support {£ 3 -}/ = i C MP and put 
Pj = F(x = £j) > 0, j 6 {1, ... , J}. By assumption, {£j}J =1 is not included 
in any hyperplane of M p . The parameter space f fl8|) is written as 

6 = {(a, (3) I l + (l-g)(a + /3 T £;) > 0, j e {1, . . . , J}}. 
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Note that is an open convex set and the origin (a, (3) = (0, 0) always 
belongs to 0. The penalized log-likelihood is, since n = 0, 

j 

L(a, (3) = J^Pj {- exp g (« + + logex P(? (« + f3 T Q) . (26) 

3=1 

By continuity of L(a, (3) over 0, it is sufficient to show that L(a, 0) — > — oo if 
(a, (3) tends to a boundary point of or (a, (3) diverges. Note that if (ao, A)) 
is a boundary point of 0, then (too, t(3o) belongs to for any < t < 1 since 
the origin does. 

We prove the claim for q < 1 first, and then q > 1. 

Let 5 < 1. Fix any boundary point (ao> A)) of 0. Then there is at least 
one £j such that exp 5 (ao + Po£j) = 0. For such £j's, exp 9 (t(«o +Po£j)) — > +0 
as t — > 1 — 0. For the other £j's, exp (£(ojo + A^?)) is bounded as £ — )■ 1 — 0. 
Then, by (|26|) . the function L(tao,t(3 ) tends to — oo as t — > 1 — 0. 

Let q < 1 and fix any (cki, A) G \ {(0,0)} such that (tai,tA) G for 
any t > 0. Then it is necessary that ci\ + Aj^- > for all j. Since {£.,■} is not 
contained in a hyperplane, there is at least one £j such that «i + Al^.? > 0. 
For such £j-'s, we have exp (tai + t(3^j) — )■ oo as t — >■ oo. For the other £,-'s, 
exp (tai + = exp g (0) = 1. Therefore, by (126]) . the function L(tai,tf3i) 

tends to — oo as t — > oo, and the case q < 1 was completed. 

Let q > 1. Fix any boundary point (ao? A) of 0. Then there is at least 
one £j such that exp q (a + f3^j) = oo. For such £j's, exp 9 (t(a + AfC?)) ~~ ^ 00 
as t — > 1 — 0. For the other £j's, exp g (t(a + AJ'O)) * s k° unQle d as £ — >■ 1 — 0. 
Then, by (|26|) . the function L(ta ,tA) tends to — oo as t — > 1 — 0. 

Finally, let g > 1 and fix any («i, A) G \ {(0, 0)} such that (tati, tfli) G 
for any t > 0. Then it is necessary that ot\ + < 0. Since {£.,•} is 
not contained in a hyperplane, there is at least one £j such that «i + Al^' < 
0. For such £j's, exp „(tai + t(3J^j) — >• +0 as t — > oo. For the other £,-'s, 
exp^tai + t/3i£j) — ex P g (0) = 1- Therefore, by ff26|) . the function L(tai,tf3i) 
tends to — oo as t — )■ oo, and the case q > 1 was completed. 
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